Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: building within_intron and spliced subsets #134

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ns-rse
Copy link
Contributor

@ns-rse ns-rse commented Jan 10, 2025

Closes #91
Closes #92

Extracts the duplicated code that builds subsets of reads that are within introns and within spliced_3ui and abstracts
out to functions with tests. Currently the tests fail and I don't think they should so I need to work out why and
perhaps add some additional fixtures to use here.

@ns-rse ns-rse marked this pull request as draft January 10, 2025 17:13
@ns-rse ns-rse added tests Testing issues refactor Refactoring labels Jan 10, 2025
@ns-rse
Copy link
Contributor Author

ns-rse commented Jan 13, 2025

Hi @IanSudberry,

I've not sussed out why the tests fail yet but have been looking closely at some of the code and logic and have some
questions I was wondering you could help with.

In this PR I've noted...

            # Why add an empty list and append a tuple?
            # Should we not be getting the keys of within_intron to check transcript_id is not in?
            if transcript_id not in within_intron:
                within_intron[transcript_id] = []
            within_intron[transcript_id].append((chromosome, start, end, strand))

...and thinking it through I realised that the reason this might be so is if there is more than one transcript_id
coming out of a given pair_feature. If this is the case the resulting value for a given transcript_id is a list of
tuples. The following dummy example demonstrates this

first = (1, 2, 3, 4)
second = ("a", "b", "c", "d")

test = {}
test["first"] = []
test["first"].append(first)
test["first"].append(second)
test["first"]

[(1, 2, 3, 4), ('a', 'b', 'c', 'd')]

Reading through all_introns_counts_and_info.py I looked for where read1_within_intron / read2_within_intron are
subsequently used and there are two places.

Line 239 - all_dicts

This makes a list of four
dictionaries

            all_dicts = [read1_within_intron, read2_within_intron, read1_spliced_3UI, read2_spliced_3UI]

...but all_dicts is never used anywhere (not a problem for now but we can remove it during refactoring).

Lines 246-259 - assign_conversions_to_retained

The only other place that read1_within_intron / read2_within_intron are then used is building a set of
assign_conversions_to_retained on lines
246-259
...

            # Create a set of tuples: (tx_id,(start,end))
            # Retained
            assign_conversions_to_retained = []

            for transcript_id, start_end_list in read1_within_intron.items():
                for start_end in start_end_list:
                    assign_conversions_to_retained.append((transcript_id, start_end))

            for transcript_id, start_end_list in read2_within_intron.items():
                for start_end in start_end_list:
                    assign_conversions_to_retained.append((transcript_id, start_end))

            # Only need each unique element once
            assign_conversions_to_retained = set(assign_conversions_to_retained)

The for ... loop here uses .items() and so transcript_id is the key and start_end_list is a list, commonly of
length 1, but possibly longer and these are then collected into a list before being de-duplicated.

I'm wondering if we can build a set more directly so wanted to check I've understood the intention correctly which is to
get a set of unique transcript_id and associated chromosome,start,end,strand?

Closes #91
Closes #92

Extracts the duplicated code that builds subsets of reads that are within introns (or span introns) and within
spliced_3ui and abstracts out to functions with tests. Currently the tests fail and I don't think they should so I need
to work out why and perhaps add some additional fixtures to use here.
@ns-rse ns-rse force-pushed the ns-rse/91-conditional-extraction branch from 743b339 to ce1c18d Compare January 13, 2025 13:38
@IanSudbery
Copy link
Collaborator

Okay, this all a bit confusing, but....

The XT tag on a read gives the gene_id of the gene to which the read is mapped. A gene can have multiple transcripts.

utrons1/2 contain a list of all the introns associated with the gene in gene_id. If two transcripts contain the same intron, this will be in those lists twice. Consider the following situation:

                                                 Genome ----------------------->

Read Alignment Blocks                     |>>>>|                     |>>>>>>>>>>>>>>>|
Transcript 1:                      |===========|---------------------|=========|-------------------|==========|
Transcript 2:              |====|------------------------------------|=========|-------------------|==========|

Introns[0]                                      ---------------------
Introns[1]                                                                      -------------------
Introns[2]                       ------------------------------------
Introns[3]                                                                      -------------------

Exon : |========|   Read alignment block:  |>>>>>|

Note that the introns list (which could be utrons1 or utrons2, but is likely to be the same for both) contains four introns.

The loop goes through these lists and for each intron tests whether the current read supports either the splicing or retention of each intron.

In this case, the final dictionaries will look like:

within_intron = {"transcript1": [introns[1]], "transcript2": [introns[3]] } # read supports retention of introns[1] and introns[3] 
read1_spliced_3UI = {"transcript1": [Introns[0]]}  #  read supports splicing of introns[0]

Note that in theory, a read could support to retention (or splicing) of multiple introns within one transcript (although I think its probably unlikely), so you could get within_intron = {"transcript1": [intron[1], intron[4]]"}

In the end, we want data per intron, not per transcript, so where two transcripts contain the same intron, they are eventually collapsed onto the same record (i.e. here we have both introns[1] and introns[3], but we really only need that interval once).

@ns-rse
Copy link
Contributor Author

ns-rse commented Jan 13, 2025

Great, thanks for that explanation @IanSudbery really useful. I will see about getting this into the documentation (noted in #136).

I've worked out some transcripts that are more useful for the tests in this Pull Request and will be updating the Pull Request in a bit (just finishing them off).

@ns-rse ns-rse marked this pull request as ready for review January 13, 2025 17:25
@ns-rse ns-rse requested review from IanSudbery and jjriley1 January 13, 2025 18:10
Identified instances where meaningful results are obtained when filtering for instances where the start/end are within
introns or there is splicing.

Tests are not fully comprehensive yet as not all logic statements are covered, in particular when a transcript spans a
whole region but I'm currently a bit hazy on whether this will always happen as the `query_length` is always
`150` (`read_length` is longer and was the cause of previous errors).
@ns-rse ns-rse force-pushed the ns-rse/91-conditional-extraction branch from 5f9fed7 to 552d5f3 Compare January 13, 2025 18:18
@ns-rse ns-rse changed the title refactor(wip): building within_intron and spliced subsets refactor: building within_intron and spliced subsets Jan 13, 2025
@ns-rse
Copy link
Contributor Author

ns-rse commented Jan 13, 2025

Identified instances where meaningful results are obtained when filtering for instances where the start/end are within introns or there is splicing.

Tests are not fully comprehensive yet as not all logic statements are covered, in particular when a transcript spans a whole region but I'm currently a bit hazy on whether this will always happen as the query_length is always 150 (read_length is longer and was the cause of previous errors in regression tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Refactoring tests Testing issues
Projects
None yet
3 participants